Class Workbook

In class activity

Ames Housing data

Please take a look at the Ames Hoursing data.

library(AmesHousing)
?ames_raw

The goal of this exercise is to predict the price of the house.

Here is a histogram of the sales price with red line showing the mean.

ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+
  geom_vline(xintercept = mean(ames_raw$SalePrice),col="red")

summary(ames_raw$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000

Initial linear model without a predictor

lmfit_null<- lm(SalePrice~1,ames_raw)
summary(lmfit_null)
## 
## Call:
## lm(formula = SalePrice ~ 1, data = ames_raw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -168007  -51296  -20796   32704  574204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   180796       1476   122.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79890 on 2929 degrees of freedom

How good is this result? Let’s look at RMSE.

sqrt(mean(residuals(lmfit_null)^2))
## [1] 79873.06

Since the price is right skewed lets log transformation the outcome

ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+geom_vline(xintercept = exp(mean(log(ames_raw$SalePrice))),col="red")+scale_x_log10()

Fitting the same model on the log transformed outcome

lmfit_null_log<- lm(log(SalePrice)~1,ames_raw)
summary(lmfit_null_log)
## 
## Call:
## lm(formula = log(SalePrice) ~ 1, data = ames_raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.56463 -0.24953 -0.03804  0.25042  1.51350 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.02097    0.00753    1596   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4076 on 2929 degrees of freedom

RMSE is

sqrt(mean((ames_raw$SalePrice-exp(predict(lmfit_null_log)))^2))
## [1] 81195.11

Notice that the RMSE is actually bigger with log transformed model. So should we not transform? What do we get from the transformation?

Here is the prediction uncertainty overlayed on the histogram.

intpred=predict(lmfit_null,interval="prediction")
## Warning in predict.lm(lmfit_null, interval = "prediction"): predictions on current data refer to _future_ responses
ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+geom_vline(xintercept = mean(ames_raw$SalePrice),col="red")+geom_vline(xintercept = intpred[1,2],col="red",lty=2)+geom_vline(xintercept = intpred[1,3],col="red",lty=2)

intpredlog=predict(lmfit_null_log,interval="prediction")
## Warning in predict.lm(lmfit_null_log, interval = "prediction"): predictions on current data refer to _future_ responses
ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+geom_vline(xintercept = exp(mean(log(ames_raw$SalePrice))),col="red")+scale_x_log10()+
  geom_vline(xintercept = exp(intpredlog[1,2]),col="red",lty=2)+
  geom_vline(xintercept = exp(intpredlog[1,3]),col="red",lty=2)

The log model seem to have a better uncertainty estimate. What good does that do?

Let’s say the model is for an algorithm that buys the house. If you pay more than the true price the company buys. If the price is lower, then the company fails to buy. - If you bought for more than the true value you’ve over paid. - If you bid less and lost, you lost a profit of the 10% of the house price.

Based on such loss function what is our overall loss if we base our decision on this model?

allres<-residuals(lmfit_null)
abs(sum(allres[allres<0]))+sum(0.1*(coef(lmfit_null)+allres[allres>0]))
## [1] 114215680
allreslog<-ames_raw$SalePrice-exp(predict(lmfit_null_log))
abs(sum(allreslog[allreslog<0]))+sum(0.1*(exp(coef(lmfit_null_log))+allreslog[allreslog>0]))
## [1] 94211655

As you can see with a better calibrated model you have a better performance for more realistic loss.

Adding predictor Gr Liv Area

We add a predictor Gr Liv Area

p=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)
p3 <- ggMarginal(p, margins = 'both', fill="skyblue", size=4,type="histogram")
p3

Using Gr Liv Area as predictor

lmfit_liv_area_0<- lm(SalePrice~`Gr Liv Area`,ames_raw)
lmfit_liv_area_1<- lm(log(SalePrice)~`Gr Liv Area`,ames_raw)
lmfit_liv_area_2<- lm(log(SalePrice)~log(`Gr Liv Area`),ames_raw)
gridExtra::grid.arrange(
ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE),
ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10(),
ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10(),
qplot(predict(lmfit_liv_area_0),residuals(lmfit_liv_area_0))+geom_smooth(method="rlm")+geom_hline(yintercept = 0,lty=2),
qplot(predict(lmfit_liv_area_1),residuals(lmfit_liv_area_1))+geom_smooth(method="rlm")+geom_hline(yintercept = 0,lty=2),
qplot(predict(lmfit_liv_area_2),residuals(lmfit_liv_area_2))+geom_smooth(method="rlm")+geom_hline(yintercept = 0,lty=2),
ncol=3
)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Caused by error in `get()`:
## ! object 'rlm' of mode 'function' was not found

Because of the skewness it’s better to take log on both x and y.

p=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()
p3 <- ggMarginal(p, margins = 'both', fill="skyblue", size=4,type="histogram")
p3

lm_mod_1 <- lm(log(SalePrice)~log(`Gr Liv Area`),ames_raw)
summary(lm_mod_1)
## 
## Call:
## lm(formula = log(SalePrice) ~ log(`Gr Liv Area`), data = ames_raw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0778 -0.1465  0.0264  0.1740  0.8602 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.43019    0.11644   46.63   <2e-16 ***
## log(`Gr Liv Area`)  0.90781    0.01602   56.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2816 on 2928 degrees of freedom
## Multiple R-squared:  0.523,  Adjusted R-squared:  0.5228 
## F-statistic:  3210 on 1 and 2928 DF,  p-value: < 2.2e-16

However, the residual still shows heterogeneous spread.

plot(predict(lm_mod_1),resid(lm_mod_1),main="residual plot");abline(h=0,lty=2,col="grey")
library(quantreg)
qu<-rq(resid(lm_mod_1)~predict(lm_mod_1),tau = c(0.1, 0.9))

colors <- c("#ffe6e6", "#ffcccc", "#ff9999", "#ff6666", "#ff3333",
            "#ff0000", "#cc0000", "#b30000", "#800000", "#4d0000", "#000000")
for (j in 1:ncol(qu$coefficients)) {
    abline(coef(qu)[, j], col = colors[j])
}

Did we reduce the residual variability?

logSalePrice<-log(ames_raw$SalePrice)
clogSalePricec <-logSalePrice-mean(logSalePrice)
rlogSalePricec<-resid(lm_mod_1)
labs<-c("mean","regression")
names(labs)<-c("clogSalePricec","rlogSalePricec")
ggplot(melt(data.frame(clogSalePricec,rlogSalePricec)))+
  geom_histogram(fill="skyblue")+
  aes(x=value,color=variable)+geom_vline(xintercept = 0,col="red")+
  facet_grid(variable~.,labeller = labeller(variable = labs))

anova(lm_mod_1)
## Analysis of Variance Table
## 
## Response: log(SalePrice)
##                      Df Sum Sq Mean Sq F value    Pr(>F)    
## log(`Gr Liv Area`)    1 254.47 254.470    3210 < 2.2e-16 ***
## Residuals          2928 232.12   0.079                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at correlation with predictors.

#`Lot Area`,`Year Built`,`Year Remod/Add`,`Total Bsmt SF`,`1st Flr SF`,`2nd Flr SF`,`Low Qual Fin SF`,`Gr Liv Area`,`Full Bath`,`Half Bath`,`Bsmt Full Bath`,`Bsmt Half Bath`,`TotRmsAbvGrd`,`Garage Area`,`Wood Deck SF`,`Open Porch SF`,`Pool Area`,`Sale Type`
M <-cor(ames_raw[,c("Lot Area","Year Built","Year Remod/Add","Total Bsmt SF","1st Flr SF","2nd Flr SF","Low Qual Fin SF","Gr Liv Area","Full Bath","Half Bath","Bsmt Full Bath","Bsmt Half Bath","TotRms AbvGrd","Garage Area","Wood Deck SF","Open Porch SF","Pool Area","SalePrice")],use="pairwise.complete.obs")
corrplot(M, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

mames<-melt(ames_raw[,c("Lot Area","Year Built","Year Remod/Add","Total Bsmt SF","1st Flr SF","2nd Flr SF","Low Qual Fin SF","Gr Liv Area","Full Bath","Half Bath","Bsmt Full Bath","Bsmt Half Bath","TotRms AbvGrd","Garage Area","Wood Deck SF","Open Porch SF","Pool Area","SalePrice")],id.vars = "SalePrice")
ggplot(mames)+geom_point()+aes(x=value,y=SalePrice)+facet_wrap(variable~.,scales = "free")
## Warning: Removed 6 rows containing missing values (`geom_point()`).

thinking about the Lot Area

When looking at lot area, its no surprise to have some relationship with the price. But the relationship is not clear linear one. Why?

ggplot(ames_raw)+geom_point()+aes(x=`Lot Area`,y=SalePrice)+xlab("Lot size in square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()#+facet_grid(~`Bedroom AbvGr`)

If you look at this by the neighborhood it become obvious how in some places size matters more than others.

ggplot(ames_raw)+geom_point(alpha=0.3)+aes(x=`Lot Area`,y=SalePrice,color=factor(Neighborhood))+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()+ theme(legend.position = "none")+facet_wrap(~Neighborhood)

Prediction of future price based on data upto 2008

To make the project more realistic, I will split the data into before 2008 and after. The data up to 2008 will be the training data nd after will be the testing data.

ames_raw_2008=ames_raw[ames_raw$`Yr Sold`<2008,]
ames_raw_2009=ames_raw[ames_raw$`Yr Sold`>=2008,]

If you look at the time trend, it seems the price is fairly stable over the years.

ames_raw$saledt <- ym(paste0(ames_raw$`Yr Sold`,"-",ames_raw$`Mo Sold`))

rolling_median <- function(formula, data, n_roll = 11, ...) {
  x <- data$x[order(data$x)]
  y <- data$y[order(data$x)]
  y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
  structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}

predict.rollmed <- function(mod, newdata, ...) {
  setNames(mod$f(newdata$x), newdata$x)
}
p=ggplot(ames_raw)+geom_point()+aes(x=saledt,y=SalePrice)+xlab("Sale Date")+ylab("Sale Price")+scale_y_log10()+geom_smooth(formula = y ~ x, method = "rolling_median",se=FALSE)+geom_vline(xintercept=ym("2008-01"),lty=2,col="orange")
p3 <- ggMarginal(p, margins = 'y', fill="skyblue", size=4,type="histogram")
p3

Fitting the null model on the training data

lmfit_null_2008     <- lm(SalePrice~1,ames_raw_2008)
lmfit_null_log_2008 <- lm(log(SalePrice)~1,ames_raw_2008)

Comparing the MSE

sqrt(mean((ames_raw_2009$SalePrice-predict(lmfit_null,newdata=ames_raw_2009))^2))
## [1] 77587.03
sqrt(mean((ames_raw_2009$SalePrice-exp(predict(lmfit_null_log,newdata=ames_raw_2009)))^2))
## [1] 78531.29

Comparing the business loss

allres_2009=ames_raw_2009$SalePrice-predict(lmfit_null,newdata=ames_raw_2009)
abs(sum(allres_2009[allres_2009<0]))+sum(0.1*(coef(lmfit_null_2008)+allres_2009[allres_2009>0]))
## [1] 63806382
allreslog_2009<-(ames_raw_2009$SalePrice-exp(predict(lmfit_null_log,newdata=ames_raw_2009)))
abs(sum(allreslog_2009[allreslog_2009<0]))+sum(0.1*(exp(coef(lmfit_null_log_2008))+allreslog_2009[allreslog_2009>0]))
## [1] 52684502

In class activity

Use data of ames_raw up to 2008 predict the housing price for the later years.

ames_raw_2008=ames_raw[ames_raw$`Yr Sold`<2008,]
ames_raw_2009=ames_raw[ames_raw$`Yr Sold`>=2008,]

Use the following loss function calculator.

calc_loss<-function(prediction,actual){
  difpred <- actual-prediction
  difpred = na.omit(difpred)
  RMSE <-sqrt(mean(difpred^2))
  operation_loss<-abs(sum(difpred[difpred<0]))+sum(0.1*actual[difpred>0])
  return(
    list(RMSE,operation_loss
         )
  )
}

Here are few rules: - You are not allowed to use the test data. - You cannot use automatic variable selection. - You need to explain why you added each variable.

lmfit_2008<- lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` +  log(`Gr Liv Area`) + `Bsmt Full Bath`  + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008) # use ames_raw_2008
lmfit_2008
## 
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` + 
##     `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) + 
##     `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + 
##     `Open Porch SF`, data = ames_raw_2008)
## 
## Coefficients:
##        (Intercept)     log(`Lot Area`)        `Year Built`    `Year Remod/Add`  
##         -5.573e+00           8.383e-02           2.898e-03           3.078e-03  
##    `Total Bsmt SF`        `1st Flr SF`  log(`Gr Liv Area`)    `Bsmt Full Bath`  
##          1.588e-04          -4.433e-05           6.828e-01           6.376e-02  
##    `TotRms AbvGrd`       `Garage Area`      `Wood Deck SF`     `Open Porch SF`  
##         -3.040e-02           2.378e-04           1.167e-04          -1.455e-04

When you decide on your model use the following to come up with your test loss.

pred_2009<-exp(predict(lmfit_2008,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)

Try to answer the following additional questions.

  • Does your model indicate a good fit? If not where is the fit off?

Your code:

summary(lmfit_2008)
## 
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` + 
##     `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) + 
##     `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + 
##     `Open Porch SF`, data = ames_raw_2008)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48256 -0.08492  0.00058  0.09723  0.68601 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.573e+00  5.456e-01 -10.214  < 2e-16 ***
## log(`Lot Area`)     8.383e-02  1.094e-02   7.661 3.57e-14 ***
## `Year Built`        2.898e-03  2.298e-04  12.614  < 2e-16 ***
## `Year Remod/Add`    3.078e-03  3.153e-04   9.762  < 2e-16 ***
## `Total Bsmt SF`     1.588e-04  1.967e-05   8.076 1.50e-15 ***
## `1st Flr SF`       -4.433e-05  2.275e-05  -1.949  0.05150 .  
## log(`Gr Liv Area`)  6.828e-01  2.934e-02  23.276  < 2e-16 ***
## `Bsmt Full Bath`    6.376e-02  1.007e-02   6.331 3.35e-10 ***
## `TotRms AbvGrd`    -3.040e-02  5.274e-03  -5.764 1.02e-08 ***
## `Garage Area`       2.378e-04  2.996e-05   7.938 4.41e-15 ***
## `Wood Deck SF`      1.167e-04  3.937e-05   2.964  0.00309 ** 
## `Open Porch SF`    -1.455e-04  7.253e-05  -2.005  0.04513 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1721 on 1306 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8178, Adjusted R-squared:  0.8163 
## F-statistic: 532.9 on 11 and 1306 DF,  p-value: < 2.2e-16
#

Your answer:

It is a good model as all the predictors are statistically significant and the Adjusted R-squared is .8163 which is pretty high. 
  • Should you include all the predictors? Why?

Your code:

#
lmfit_2008all = lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `2nd Flr SF`+ `Low Qual Fin SF` + log(`Gr Liv Area`) + `Full Bath` + `Half Bath`+ `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` + `Pool Area`, data = ames_raw_2008)
pred_2009<-exp(predict(lmfit_2008all,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)
## [[1]]
## [1] 43840.48
## 
## [[2]]
## [1] 32470209
summary(lmfit_2008all)
## 
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` + 
##     `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `2nd Flr SF` + 
##     `Low Qual Fin SF` + log(`Gr Liv Area`) + `Full Bath` + `Half Bath` + 
##     `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` + 
##     `Wood Deck SF` + `Open Porch SF` + `Pool Area`, data = ames_raw_2008)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52376 -0.08264  0.00067  0.09629  0.76399 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.809e+00  7.184e-01  -8.086 1.40e-15 ***
## log(`Lot Area`)     8.272e-02  1.099e-02   7.529 9.54e-14 ***
## `Year Built`        2.976e-03  2.644e-04  11.259  < 2e-16 ***
## `Year Remod/Add`    3.139e-03  3.199e-04   9.812  < 2e-16 ***
## `Total Bsmt SF`     1.572e-04  2.000e-05   7.862 7.89e-15 ***
## `1st Flr SF`       -3.392e-05  4.335e-05  -0.783  0.43405    
## `2nd Flr SF`        1.850e-05  3.895e-05   0.475  0.63486    
## `Low Qual Fin SF`  -1.491e-04  1.158e-04  -1.288  0.19810    
## log(`Gr Liv Area`)  6.794e-01  6.135e-02  11.075  < 2e-16 ***
## `Full Bath`        -1.313e-02  1.427e-02  -0.920  0.35760    
## `Half Bath`        -7.569e-03  1.438e-02  -0.526  0.59863    
## `Bsmt Full Bath`    6.460e-02  1.051e-02   6.144 1.07e-09 ***
## `Bsmt Half Bath`    2.497e-02  1.912e-02   1.306  0.19174    
## `TotRms AbvGrd`    -3.000e-02  5.418e-03  -5.538 3.70e-08 ***
## `Garage Area`       2.371e-04  3.005e-05   7.891 6.34e-15 ***
## `Wood Deck SF`      1.080e-04  3.997e-05   2.703  0.00695 ** 
## `Open Porch SF`    -1.416e-04  7.339e-05  -1.930  0.05381 .  
## `Pool Area`         1.631e-05  1.019e-04   0.160  0.87290    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1722 on 1300 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8185, Adjusted R-squared:  0.8161 
## F-statistic: 344.8 on 17 and 1300 DF,  p-value: < 2.2e-16

Your answer:

lmfit_2008some = lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `Low Qual Fin SF` + log(`Gr Liv Area`) + `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008)
pred_2009<-exp(predict(lmfit_2008some,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)
## [[1]]
## [1] 43015.08
## 
## [[2]]
## [1] 32494178
summary(lmfit_2008some)
## 
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` + 
##     `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `Low Qual Fin SF` + 
##     log(`Gr Liv Area`) + `Bsmt Full Bath` + `Bsmt Half Bath` + 
##     `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF`, 
##     data = ames_raw_2008)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50374 -0.08347 -0.00057  0.09616  0.76774 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.567e+00  5.468e-01 -10.181  < 2e-16 ***
## log(`Lot Area`)     8.280e-02  1.095e-02   7.561 7.51e-14 ***
## `Year Built`        2.861e-03  2.314e-04  12.368  < 2e-16 ***
## `Year Remod/Add`    3.101e-03  3.153e-04   9.835  < 2e-16 ***
## `Total Bsmt SF`     1.577e-04  1.972e-05   7.998 2.76e-15 ***
## `1st Flr SF`       -4.636e-05  2.276e-05  -2.037  0.04185 *  
## `Low Qual Fin SF`  -1.558e-04  1.121e-04  -1.390  0.16481    
## log(`Gr Liv Area`)  6.868e-01  2.938e-02  23.372  < 2e-16 ***
## `Bsmt Full Bath`    6.686e-02  1.025e-02   6.520 1.00e-10 ***
## `Bsmt Half Bath`    2.653e-02  1.894e-02   1.401  0.16158    
## `TotRms AbvGrd`    -2.979e-02  5.283e-03  -5.640 2.09e-08 ***
## `Garage Area`       2.374e-04  3.001e-05   7.913 5.35e-15 ***
## `Wood Deck SF`      1.102e-04  3.956e-05   2.786  0.00541 ** 
## `Open Porch SF`    -1.418e-04  7.251e-05  -1.955  0.05079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.172 on 1304 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8183, Adjusted R-squared:  0.8165 
## F-statistic: 451.9 on 13 and 1304 DF,  p-value: < 2.2e-16
I tried delete some predictor which has a high p value in the lmfit_2008all model(model with all predictor), however the loss is higher than before. So I think the lmfit_2008all model(model with all predictor) can best predict the price. While I got the highest Adjusted R-squared in lmfit_2008some after delete several predictor. 
  • What interaction makes sense? Does your model indicate signs of interaction?

Your code:

lmfit_2008some = lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `Low Qual Fin SF` + log(`Gr Liv Area`) + `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008)
pred_2009<-exp(predict(lmfit_2008some,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)
## [[1]]
## [1] 43015.08
## 
## [[2]]
## [1] 32494178

Your answer:

All the predictors are statistically significant, I don't know how to add a interaction. 
  • Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor, fit a model with polynomial terms up to 3rd order.

Your code:

#
#

Your answer:

Please write your answer in full sentences.
  • What are the top 5 houses that are most over priced based on your model?

Your code:

#
#

Your answer:

Please write your answer in full sentences.
  • What are the top 5 most good deal based on your model?

Your code:

#
#

Your answer:

Please write your answer in full sentences.

Problem Set

[Problems] Linear Regression Problems

  1. This question involves the use of simple linear regression on the Auto data set.
  1. Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. mpg(miles per gallon)

Your code:

Auto = read.csv('Auto.csv')
lmfitAuto = lm(mpg ~ horsepower, data = Auto)
summary(lmfitAuto)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Your answer:

Horsepower is statistically significant. When increase one horsepower, mpg is expected to decrease 0.158. 

Comment on the output. For example: i. Is there a relationship between the predictor and the response?

  1. How strong is the relationship between the predictor and the response? More than 99% confidence.
  2. Is the relationship between the predictor and the response positive or negative? Negative
  3. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
predicted_mpg <- predict(lmfitAuto, newdata = data.frame(horsepower = 98), interval = "confidence")
predicted_mpg
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
  1. Plot the response and the predictor. Use the abline() function to display the least squares regression line.

Your code:

ggplot(Auto)+
  geom_point()+
  aes(x = horsepower, y = mpg)+
  xlab("horsepower")+
  ylab("mpg")+
  geom_abline(slope = -0.158, intercept = 39.93, col = "blue", size = 1)+
  theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#

Your answer:

Please write your answer in full sentences.
  1. Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

Your code:

plot(lmfitAuto)

Your answer:

These plot show that the residual is not a normal distribution which mean that the model could be improved. 
  1. This question involves the use of multiple linear regression on the Auto data set.
  1. Produce a scatterplot matrix which includes all of the variables in the data set.

Your code:

pairs(subset(Auto, select = -name), pch = 1)

Your answer:

From this graph, there is obvious correlation between several predictors. 
  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Your code:

cor(subset(Auto, select = -name))
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
#

Your answer:

High corelation between displacement, cylinders, horsepower and weight. 
  1. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

Your code:

lmfitAutoall = lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lmfitAutoall)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Your answer:

Displacement, weight, year, origin are significant while others are not. There is a relationship betweeen the predictors and the response as the R square is high. 

Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response?

  1. What does the coefficient for the year variable suggest? The year is statistically significant. When year increase 1, mpg is expected to increased 0.75.
  1. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Your code:

plot(lmfitAutoall)

#

Your answer:

    325, 324, 321 are unusual outliers. 14 has a high leverage. 
  1. Use the * and : symbols to fit linear regression models with interaction effects. Why did you add the interaction? Explain using the context of the problem. Do any interactions appear to be statistically significant?

Your code:

# Fit a linear regression model with interactions between two predictors
lmAuto_i <- lm(mpg ~ cylinders * displacement + 
                                    cylinders * horsepower + 
                                    cylinders * weight + 
                                    cylinders * acceleration + 
                                    cylinders * year + 
                                    cylinders * origin +
                                    displacement * horsepower + 
                                    displacement * weight + 
                                    displacement * acceleration + 
                                    displacement * year + 
                                    displacement * origin +
                                    horsepower * weight + 
                                    horsepower * acceleration + 
                                    horsepower * year + 
                                    horsepower * origin +
                                    weight * acceleration + 
                                    weight * year + 
                                    weight * origin +
                                    acceleration * year + 
                                    acceleration * origin +
                                    year * origin, 
                            data = Auto)
summary(lmAuto_i)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + cylinders * horsepower + 
##     cylinders * weight + cylinders * acceleration + cylinders * 
##     year + cylinders * origin + displacement * horsepower + displacement * 
##     weight + displacement * acceleration + displacement * year + 
##     displacement * origin + horsepower * weight + horsepower * 
##     acceleration + horsepower * year + horsepower * origin + 
##     weight * acceleration + weight * year + weight * origin + 
##     acceleration * year + acceleration * origin + year * origin, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Your answer:

displacement:year, acceleration:year, acceleration:origin are significant. 
I add all the interactions and choose the interactions with low p value. 
  1. Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.

This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

Your code:

library(ISLR2)
lmCarseats = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lmCarseats)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Your answer:

It is not a good model as the R is too low. While p value of F is pretty high.
  1. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Your answer:

We failed to reject that urban won't influence Sales.
When US is yes, sales is expected to increase 1.2 compare to US is no when others fix. 
When price increase 1, sales is expected to decrease 0.05 when others fix. 
  1. Write out the model in equation form, being careful to handle the qualitative variables properly.

Your answer:

$$ \text{Sales} = 13.043469 - 0.054459 \times \text{Price} - 0.021916 \times \text{UrbanYes} + 1.200573 \times \text{USYes} + \varepsilon $$
  1. For which of the predictors can you reject the null hypothesis \(H_0 :\beta_j = 0\)?

Your answer:

Price and US. 
  1. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Your code:

lmCarseats_s = lm(Sales ~ Price + US, data = Carseats)
summary(lmCarseats_s)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
  1. How well do the models in (a) and (e) fit the data?

Your answer:

The model in (e) is a little bit better than the model in (a). But as the ajusted R square is too low, the model both didn't fit well. 
  1. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

Your code:

conf_intervals <- confint(lmCarseats_s)
conf_intervals
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Your answer:

Please write your answer in full sentences.
  1. Is there evidence of outliers or high leverage observations in the model from (e)?

Your answer:

plot(lmCarseats_s)

Yes, 69, 37, 51 are outliers and the point 26, 50, 368 are high leverage observations. 

Additional Material

K-nn regression

You can do KNN regression using FNN package. Read more about it here: https://daviddalpiaz.github.io/r4sl/knn-reg.html

library(FNN)
knn.reg(train = ?, test = ?, y = ?, k = ?)
  • train: the predictors of the training data
  • test: the predictor values, x, at which we would like to make predictions
  • y: the response for the training data
  • k: the number of neighbors to consider

[Advanced] Predictive Modeling Platforms in R

There are few platforms in R that does predictive modeling. These platforms are wrappers around other packages that makes it easy to do routine tasks.

# split the data
index <- sample(1:nrow(ames_raw), 0.7*nrow(ames_raw))
vars <- c("SalePrice","Lot Area","Gr Liv Area","Full Bath")
train <- ames_raw[ index, vars]
test  <- ames_raw[-index, vars]
colnames(train) <- make.names(colnames(train))
colnames(test)  <- make.names(colnames(test))

# mlr3 TaskRegr
train$SalePrice <- log(train$SalePrice)

Prediction using mlr3

MLR3 is a useful ML platform on R. You can learn about it here: https://mlr3book.mlr-org.com There is MLR and MLR3. It’s better to use 3 since MLR is no longer actively developed.

# load packages and data
library(mlr3)
library(mlr3learners)

# fit a model

task <- as_task_regr(train, target ="SalePrice",id = "ames_raw")

# TaskRegr$new(id = "ames_raw", backend = train, target ="SalePrice")
learner <- lrn("regr.lm", predict_type = "response")
learner$train(task)
prediction=predict(learner,newdata=test)

Prediction using tidymodels

Tidymodels is another one of the tidy family that is similar to mlr3. https://www.tidymodels.org/

# load packages and data
library(tidymodels)
library(dotwhisker)
# fit a model
rec <- recipe(SalePrice ~ ., data = train) 

clf <- linear_reg() 

wflow <- workflow() %>%
         add_recipe(rec) %>%
         add_model(clf)

lm_fit <- wflow %>% fit(data = train)
prediction=predict(lm_fit,new_data=test)
tidy(lm_fit) %>% 
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

Prediction using caret

# load packages and data
library(caret)

# fit a model
ctrl <- trainControl(method = "none")
lm_model <- train(SalePrice ~ ., data = train, method = "lm", trControl = ctrl)

prediction=predict(lm_model,newdata = test)

Prediction using h2o

H2O is a cross platform library that is popular in the predictive modeling space. They work well out of box and can be called from any platform independent of the language used. Making it work on R could sometimes be a headache. So try using H2O but if you cannot make it work, I recommend you leave this section alone.

If you are on Mac you will need to install Java (http://www.java.com ).

install.packages("h2o")
# load packages and data
library(h2o)
packageVersion("h2o")
Starting H2O

To use H2O you need to instantiate it.

# nthreads specifies number of threads. -1 means use all the CPU cores.
# max_mem_size specifies the maximum amount of RAM to use.
localH2O <- h2o.init(nthreads = -1, max_mem_size="4g")

You can access H2O instance using the web UI FLOW by typing http://localhost:54321 in your browser.

Serving the data to H2O

Since H2O is not in R, you need to tell it to use your data.

train_hf <- as.h2o(train)
test_hf <- as.h2o(test)
Fitting GLM
gaussian.fit = h2o.glm(y = "SalePrice",                               #response variable 
                       x = c("SalePrice","Lot.Area","Gr.Liv.Area","Full.Bath"),  #predictor variables
                      training_frame = train_hf,                  #data
                      family = "gaussian",lambda = 0)           #specify the dist. of y and penalty parameter: lambda
gaussian.fit

prediction=predict(gaussian.fit,newdata = test_hf)
h2o.exportFile(prediction, "/tmp/pred.csv", force = TRUE) #export prediction result as a file
Saving and loading the model
# save the model
model_path <- h2o.saveModel(object=gaussian.fit, path=getwd(), force=TRUE)
print(model_path)

# load the model
saved_model <- h2o.loadModel(model_path)  #extract the saved model
saved_model
Shut down H2O
h2o.shutdown(prompt =F) 

Advanced Content

Best Linear Unbiased Estimator

Euler and motion of planets

The idea of regression started with Euler(1949) when he was studying the motion of planets. He wanted to predict the location of Jupiter. Euler had 75 observations for a linear model with seven unknown constants; equivalently, he had 75 equations and seven unknowns.

Let’s translate this into a more familiar notation that we are used to. We will let \[\begin{eqnarray*} y_i=\varphi_i\verb|, |\boldsymbol{\beta}=\left(\begin{array}{c} \beta_1\\ \beta_2\\ \vdots\\ \beta_p \end{array}\right) \verb|, and | \mathbf{X}_i=\left(\begin{array}{c} x_{i1} \\ x_{i2} \\ \vdots\\ x_{ip} \\ \end{array}\right) \end{eqnarray*}\]

Then, we can write the equation as \[\begin{eqnarray*} y_i=\mathbf{X}_i^T\boldsymbol{\beta}=x_{i1}\beta_1+x_{i2}\beta_2+\cdots+x_{ip}\beta_p \verb| for |i=1,\cdots,n=75 \end{eqnarray*}\]

There are a few assumptions that went into the model

  • \(y_i\) is linear in \(\beta\)
  • But it can be non-linear in the variables.
  • The model is known to be correct a priori

From calculus, we know that when we have seven unknowns, we need seven equations to estimate those unknowns. Since Euler had 75 equations, he needed a way to convert those 75 equations into seven equations. The simplest thing to do is randomly choose seven equations out of 75 equations and solve for the unknowns using only the selected equations. But Laplace came up with a better idea: to combine 75 equations so that you get seven equations as weighted linear combinations of the 75 equations. That is, each equation is

\[\begin{eqnarray*} \sum^n_{i=1}w_iy_i=\sum^n_{i=1}w_i\mathbf{X}_i^T\boldsymbol{\beta} \end{eqnarray*}\]

If we combine all the weights into a matrix \[\begin{eqnarray*} \mathbf{W}=\left( \begin{array}{cccc} w_{11}&w_{12}&\cdots &w_{1p} \\ \vdots&\vdots&\ddots&\vdots\\ w_{n1}&w_{n2}&\cdots &w_{np} \\ \end{array} \right) \end{eqnarray*}\]

Then 75 equations \(Y_{75\times 1}=\mathbf{X}_{75\times 7}\boldsymbol{\beta}_{7\times1}\) can be expressed as \[\begin{eqnarray*} (\mathbf{W}^TY)_{7\times 1}=(\mathbf{W}^T\mathbf{X})_{7\times 7}\boldsymbol{\beta} \rightarrow (\mathbf{W}^T\mathbf{X})^{-1}(\mathbf{W}^TY)_{7\times 1}=\hat{\hspace{0pt}\boldsymbol{\beta}} \end{eqnarray*}\]

This makes us ask a new question: what \(\mathbf{W}\) do we want to get the optimal estimate of \(\boldsymbol{\beta}\)?

Special case for when \(p=1\)

Let us start with a case where we only have one unknown variable. we have \[\begin{eqnarray*} \mathbf{X}=\left[ \begin{array}{c} x_{1} \\ \vdots\\ x_{n} \\ \end{array} \right]\verb|, | Y=\left[ \begin{array}{c} y_{1} \\ \vdots\\ y_{n} \\ \end{array} \right]\verb|, and | \mathbf{W}=\left[ \begin{array}{c} w_{1} \\ \vdots\\ w_{n} \\ \end{array} \right] \end{eqnarray*}\]

In this case, we can combine \(n\) equations as \[\begin{eqnarray*} \sum^n_{i=1}w_iy_i=\sum^n_{i=1}w_i\mathbf{X}_i^T\boldsymbol{\beta} \end{eqnarray*}\] solving for \(\boldsymbol{\beta}\) gives us \[\begin{eqnarray*} \hat{\hspace{0pt}\boldsymbol{\beta}}&=&\frac{\sum_i^n w_iy_i}{\sum_i^n w_i x_i} \\ &=&\frac{\langle \mathbf{W}, Y\rangle}{\langle \mathbf{W}, \mathbf{X}\rangle} \end{eqnarray*}\]

What is the optimal \(\mathbf{W}\) in this case? Let’s formulate this problem in a statistical framework.

We need to make a few assumptions As already stated, we will have the following assumptions.

  • There is true model for \(\boldsymbol{\beta}_{true}\) \(Y=\mathbf{X}_i\boldsymbol{\beta}_{true}+\epsilon_i\)
  • \(E(\epsilon_i)=0\), \(Var(\epsilon_i)=\sigma^2\), and \(Cov(\epsilon_i,\epsilon_j)=0\) for all \(i,j=1,\cdots, n\)
  • \(\mathbf{X}_i\)’s are fixed by design

We will start with the estimating equation \[\begin{eqnarray*} \hat{\hspace{0pt}\boldsymbol{\beta}}&=&\frac{\langle \mathbf{W}, \mathbf{X}\boldsymbol{\beta}_{true}+\epsilon \rangle}{\langle \mathbf{W}, \mathbf{X}\rangle}\\ &=&\boldsymbol{\beta}_{true}+\frac{\langle \mathbf{W}, \epsilon \rangle}{\langle \mathbf{W}, \mathbf{X}\rangle}\\ \end{eqnarray*}\]

The Frequentist view of statistics is relevant for this case because when we fit a model, we want to use the procedure repeatedly. Thus, we can do hypothetical repeated sampling. For each repetition, we get \(\epsilon_i\)’s, from which we get \(Y\), which gives us different \(\hat{\hspace{0pt}\boldsymbol{\beta}}\).
Then \(E(\hat{\hspace{0pt}\boldsymbol{\beta}})\) is the long run average under repetition. \(Var(\hat{\hspace{0pt}\boldsymbol{\beta}})\) is long run variance under repetition.

From the assumptions, we know \[\begin{eqnarray*} E\left(\langle \mathbf{W}, \epsilon\rangle\right)&=&\sum^n_{i=1}E\left(w_i \epsilon_i\right)\\ &=&\sum^n_{i=1}w_i E(\epsilon_i)\\ &=&0 \end{eqnarray*}\] and \[\begin{eqnarray*} Var\left(\langle \mathbf{W}, \epsilon\rangle \right)&=&Var\left(\sum^n_{i=1}w_i \epsilon_i \right)\\ &=&\sum^n_{i=1}Var(w_i \epsilon_i)\\ &=&\sum^n_{i=1}w_i^2Var(\epsilon_i)\\ &=&\lVert w \rVert^2\sigma^2\\ \end{eqnarray*}\]

Therefore \[\begin{eqnarray*} E\left( \hat{\hspace{0pt}\boldsymbol{\beta}} \right) &=& E\left(\boldsymbol{\beta}_{true}+ \frac{\langle \mathbf{W} \ \epsilon\rangle }{\langle \mathbf{W}\ \mathbf{X}\rangle} \right)\\ &=&\boldsymbol{\beta}_{true}\\ \end{eqnarray*}\] \[\begin{eqnarray*} Var\left(\hat{\hspace{0pt}\boldsymbol{\beta}}\right) &=& Var\left(\boldsymbol{\beta}_{true}+ \frac{\langle \mathbf{W} \ \epsilon\rangle }{\langle \mathbf{W}\ \mathbf{X}\rangle }\right)\\ &=&Var\left(\frac{\langle \mathbf{W} \ \epsilon\rangle }{\langle \mathbf{W}\ \mathbf{X}\rangle }\right)\\ &=&\frac{\lVert \mathbf{W} \rVert^2\sigma^2}{|\langle \mathbf{W}\ \mathbf{X} \rangle |^2}\\ &=&\frac{\lVert \mathbf{W} \rVert^2\sigma^2}{(\lVert \mathbf{W}\lVert \lVert \mathbf{X} \lVert cos\theta)^2}\\ &=&\frac{\sigma^2}{\lVert \mathbf{X} \rVert^2 cos^2\theta} \end{eqnarray*}\] Thus we see in general \(Var(\hat{\hspace{0pt}\boldsymbol{\beta}})\) is minimized when \(cos^2\theta=1\), which is when \(\theta=0\), hence \(\mathbf{W}\propto \mathbf{X}\). Indicating that when \(\mathbf{W} = c\mathbf{X}_i\) for some constant \(c\), \(\hat{\hspace{0pt}\boldsymbol{\beta}}\) will have minimum variance.

Alternatively we can start from objective function \(R(\boldsymbol{\beta})=\sum^{n}_{i=1}{(y_i-\mathbf{X}_i\boldsymbol{\beta})^2}\) and try to find \(\boldsymbol{\beta}\) that minimizes \(R(\boldsymbol{\beta})\). \[\begin{eqnarray*} R(\boldsymbol{\beta})'&=&-2\sum^{n}_{i=1}{(y_i-x_i\boldsymbol{\beta})x_i}=0\\ &\Rightarrow&\sum^{n}_{i=1}{(x_i^2\boldsymbol{\beta})}=\sum^{n}_{i=1}{(y_ix_i)}\\ &\Rightarrow&\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}=\frac{\sum^{n}_{i=1}{(y_ix_i)}}{\sum^{n}_{i=1}{(x_i^2)}}\\ \end{eqnarray*}\] The resulting \(\boldsymbol{\beta}_{LS}\) is the least squares estimate of \(\boldsymbol{\beta}\). Which we see is a particular case of the general estimating equation when \(w_i=\mathbf{X}_i\). Because of this result, the Least Squared Estimate is called the best unbiased linear estimator (BLUE).

For General \(p\)

We can do the same for more general \(p\), for \[\begin{eqnarray*} \mathbf{X}=\left[ \begin{array}{cccc} x_{11}&x_{12}&\cdots &x_{1p} \\ \vdots&\vdots&\ddots&\vdots\\ x_{n1}&x_{n2}&\cdots &x_{np} \\ \end{array} \right]\verb|, | Y=\left[ \begin{array}{c} y_{1} \\ \vdots\\ y_{n} \\ \end{array} \right]\verb|, | \epsilon=\left[ \begin{array}{c} \epsilon_{1} \\ \vdots\\ \epsilon_{n} \\ \end{array} \right] \verb|, and | \boldsymbol{\beta}=\left[ \begin{array}{c} \beta_{1} \\ \vdots\\ \beta_{p} \\ \end{array} \right] \end{eqnarray*}\]

We still have the same assumption

  • There is true model for \(\boldsymbol{\beta}_{true}\), \(Y=\mathbf{X}\boldsymbol{\beta}_{true}+\epsilon\)
  • \(E(\epsilon)=0\), \(Var(\epsilon)=\sigma^2I\)
  • \(X\) is fixed by design

Again define objective function as \(R(\boldsymbol{\beta})=\sum^{n}_{i=1}{\lVert Y-\mathbf{X}\boldsymbol{\beta}\rVert^2}\), least square estimate of \(\boldsymbol{\beta}\), \(\boldsymbol{\beta}_{LS}\) is the one that minimizes this loss function. We can derive \(\boldsymbol{\beta}_{LS}\) via taking the derivative of \(R(\boldsymbol{\beta})\) with respect to \(\boldsymbol{\beta}\) and setting that equal to zero. Hence

\[\begin{eqnarray*} R(\boldsymbol{\beta})'&=&-2\mathbf{X}^{T}{(Y-\mathbf{X}\boldsymbol{\beta})}\\ &\Rightarrow&-2(\mathbf{X}^{T}Y-\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta})=0\\ &\Rightarrow&\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta}=\mathbf{X}^{T}Y\\ &\Rightarrow&\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y\\ \end{eqnarray*}\] So, our estimated value of \(Y\) in the space spanned by \(\mathbf{X}\) using the least squares estimate of the \(\boldsymbol{\beta}\) can be written as \[\begin{eqnarray*} \hat{\hspace{0pt}Y}=\mathbf{X}\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y\\ \end{eqnarray*}\]

Because \(\hat{\hspace{0pt}Y}\) lives in space spanned by \(X\) since \(X\boldsymbol{\beta}\) is linear combination of \(\mathbf{X}\), \(\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\) can be thought of as projection matrix that projects \(Y\) onto space spanned by \(\mathbf{X}\). This matrix is often denoted by capital letter \(H\) and called the hat matrix \(H=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\). The hat matrix is a symmetric matrix that implies \(H^T=H\) and \(H^2=H\).

Under our assumption, least square estimates have the following property. \[\begin{eqnarray*} E(\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS})&=&E((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y)\\ &=&E((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}(\mathbf{X}\boldsymbol{\beta}_{true}+\epsilon))\\ &=&\boldsymbol{\beta}_{true}\\ \end{eqnarray*}\] \[\begin{eqnarray*} Var(\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS})&=&Var((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y)\\ &=&Var((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}(\mathbf{X}\boldsymbol{\beta}_{true}+\epsilon))\\ &=&Var((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\epsilon)\\ &=&(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Var(\epsilon)((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T})^T\\ &=&(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\sigma^2I((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T})^T\\ &=&\sigma^2(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\\ &=&\sigma^2(\mathbf{X}^{T}\mathbf{X})^{-1}\\ \end{eqnarray*}\]

Going back to the general estimating equation, \[\begin{eqnarray*} \hat{\hspace{0pt}\boldsymbol{\beta}}=(\mathbf{W}^T\mathbf{X})^{-1}(\mathbf{W}^TY)&=&(\mathbf{W}^T\mathbf{X})^{-1}(\mathbf{W}^T\mathbf{X}\boldsymbol{\beta}_{true}+\epsilon)\\ &=&\boldsymbol{\beta}_{true}+(\mathbf{W}^T\mathbf{X})^{-1}\mathbf{W}^T\epsilon \end{eqnarray*}\] \[\begin{eqnarray*} E(\hat{\hspace{0pt}\boldsymbol{\beta}})&=&E(\boldsymbol{\beta}_{true}+(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&E(\boldsymbol{\beta}_{true})+E((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&\boldsymbol{\beta}_{true}+(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T E(\epsilon)\\ &=&\boldsymbol{\beta}_{true}\\ \end{eqnarray*}\] \[\begin{eqnarray*} Var(\hat{\hspace{0pt}\boldsymbol{\beta}})&=&Var(\boldsymbol{\beta}_{true}+(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&Var((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T Var(\epsilon)((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ &=&(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T \sigma^2I((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ &=&\sigma^2(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T ((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ \end{eqnarray*}\]

If we denote \((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\) as \(A\) then \[\begin{eqnarray*} Var(\hat{\hspace{0pt}\boldsymbol{\beta}})&=&\sigma^2\lVert A\rVert^2\\ \end{eqnarray*}\]

We will use the following consequence of Cauchy-Schwarz inequality due to the symmetry of the hat matrix \[\begin{eqnarray*} \lVert Y\rVert^2 \geq \lVert \hat{\hspace{0pt}Y}\rVert^2 \Rightarrow Y^TY \geq (HY)^T(HY)=Y^THY \end{eqnarray*}\] which is true for all \(Y\).

Hence \[\begin{eqnarray*} A^TA&\geq&A^TH^THA \\ &=&(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T (\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ &=&(\mathbf{X}^T \mathbf{X})^{-1} \end{eqnarray*}\]

Therefore we have \(Var(\hat{\hspace{0pt}\boldsymbol{\beta}})\geq \sigma^2(\mathbf{X}^T \mathbf{X})^{-1}=Var(\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS})\), telling us once again that the least squares archive the minimum variance of the general estimator estimate, proving the BLUE property of least squares estimate.

Model Complexity

Up to here, we assumed we knew that the true model is a linear combination of \(\mathbf{X}\), but we know that is not true in general. However, that still does not invalidate the use of least squares estimate, as Box’s quote has it, ``All models are wrong, but some are useful’’. So, let us look at how our model will behave under more general assumptions.

We will assume \(Y=f(\mathbf{X})+\epsilon\) where \(f\) is some unknown function, and \(E[\epsilon]=0\) and \(Var[\epsilon]=\sigma^2I\). Also we will assume \(\mathbf{X}_1,\mathbf{X}_2,\cdots,\mathbf{X}_p\) are orthonormal. We will not assume a linear model for \(f\), but we will try to estimate \(Y\) using the method of least squares. Thus \(\hat{\hspace{0pt}Y}=H(f+\epsilon)=\hat{\hspace{0pt}f}+\hat{\hspace{0pt}\epsilon}\). \(\hat{\hspace{0pt}f}=\mathbf{X}\boldsymbol{\beta}_{best}\) where \(\boldsymbol{\beta}_{best}\) is the best we can do knowing our true model \(f\) is not linear. Also \(\hat{\hspace{0pt}\epsilon}=\mathbf{X}\delta\) where \(\delta=[\delta_1,\delta_2,\cdots,\delta_p]^T\) and \(E[\delta_i]=0\), \(Var(\delta_i)=\sigma^2\) hence \(E(\epsilon)=0\) and \(E\left[\lVert\epsilon\rVert^2\right]=E\left[ \lVert \mathbf{X}_1\rVert^2\delta_1^2+\cdots+\lVert\mathbf{X}_p\rVert^2\delta_p^2\right]=p\sigma2^2\) which follows since \(\lVert \mathbf{X}_i\rVert^2=1\) from orthonormal assumption we made earlier.

obs predictor f randomn in train training data randomness in test testing data
1 \(x_{11}\) \(x_{12}\) \(\cdots\) \(x_{1p}\) \(f_1\) \(\epsilon_1\) \(y_1=f_1+\epsilon_1\) \(\epsilon_{1,new}\) \(y_{1,new}=f_1+\epsilon_{1,new}\)
2 \(x_{21}\) \(x_{22}\) \(\cdots\) \(x_{2p}\) \(f_2\) \(\epsilon_2\) \(y_2=f_2+\epsilon_2\) \(\epsilon_{2,new}\) \(y_{2,new}=f_2+\epsilon_{2,new}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
n \(x_{n1}\) \(x_{n2}\) \(\cdots\) \(x_{np}\) \(f_n\) \(\epsilon_n\) \(y_n=f_n+\epsilon_n\) \(\epsilon_{n,new}\) \(y_{n,new}=f_n+\epsilon_{n,new}\)
——— ———– ———- ———- ———- ——— ————– ———————- ——————- ———————————
  \(\mathbf{X}_{1}\) \(\mathbf{X}_{2}\) \(\cdots\) \(\mathbf{X}_{p}\) \(f\) \(\epsilon\) \(Y=f+\epsilon\) \(\epsilon_{new}\) \(Y_{new}=f+\epsilon_{new}\)

Under this setting, let’s look at the training and testing errors and their behavior. - Training Error \[\begin{eqnarray*} E\left[\lVert Y-\hat{\hspace{0pt}Y}\rVert^2\right]&=&E\left[\lVert (f+\epsilon)-(\hat{\hspace{0pt}f}+\hat{\hspace{0pt}\epsilon})\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})+(\epsilon-\hat{\hspace{0pt}\epsilon})\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})\rVert^2+\lVert(\epsilon-\hat{\hspace{0pt}\epsilon})\rVert^2-2\langle(f-\hat{\hspace{0pt}f})\ (\epsilon-\hat{\hspace{0pt}\epsilon})\rangle\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+E\left[\lVert\epsilon\rVert^2-\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2-p\sigma^2\\ \end{eqnarray*}\]

Which follows from the fact that \(\hat{\hspace{0pt}\epsilon}\) is the projection of \(\epsilon\),

\[\begin{eqnarray*} E\left[\lVert\epsilon-\hat{\hspace{0pt}\epsilon}\rVert^2\right]&=&E\left[\lVert\epsilon\rVert^2\right]+E\left[\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\right] -2E\left[\langle \epsilon\ \hat{\hspace{0pt}\epsilon}\rangle\right]\\ &=&n\sigma^2+p\sigma^2-2p\sigma^2\\ &=&n\sigma^2-p\sigma^2\\ \end{eqnarray*}\] Thus we get the Pythagoras Theorem \(\lVert\epsilon-\hat{\hspace{0pt}\epsilon}\rVert^2=\lVert \epsilon\rVert^2-\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\).

  • Testing Error \[\begin{eqnarray*} E\left[\lVert Y_{new}-\hat{\hspace{0pt}Y}\rVert^2\right]&=&E\left[\lVert (f+\epsilon_{new})-(\hat{\hspace{0pt}f}+\hat{\hspace{0pt}\epsilon})\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})+(\epsilon_{new}-\hat{\hspace{0pt}\epsilon})\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})\rVert^2+\lVert(\epsilon_{new}-\hat{\hspace{0pt}\epsilon})\rVert^2-2\langle(f-\hat{\hspace{0pt}f})\ (\epsilon_{new}-\hat{\hspace{0pt}\epsilon})\rangle\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+E\left[\lVert\epsilon_{new}\rVert^2+\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2+p\sigma^2\\ &=&\verb|testing error| = \verb|training error| + 2p\sigma^2 \end{eqnarray*}\]

But for testing error this is different since \(\hat{\hspace{0pt}\epsilon}\) is not projection of \(\epsilon_{new}\):

\[\begin{eqnarray*} E\left[\lVert\epsilon_{new}-\hat{\hspace{0pt}\epsilon}\rVert^2\right]&=&E\left[\lVert\epsilon_{new}\rVert^2\right]+E\left[\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\right] -2E\left[\langle \epsilon_{new}\ \hat{\hspace{0pt}\epsilon}\rangle\right]\\ &=&n\sigma^2+p\sigma^2-0\\ &=&n\sigma^2+p\sigma^2\\ \end{eqnarray*}\]
  • Estimation Error \[\begin{eqnarray*} E\left[\lVert \hat{\hspace{0pt}Y}-f\rVert^2\right]&=&E\left[\lVert (\hat{\hspace{0pt}f}+\hat{\hspace{0pt}\epsilon})-f\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})+\hat{\hspace{0pt}\epsilon}\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})\rVert^2+\lVert(\hat{\hspace{0pt}\epsilon})\rVert^2-2\langle(f-\hat{\hspace{0pt}f})\ \hat{\hspace{0pt}\epsilon}\rangle\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+p\sigma^2\\ &=&\verb|estimation error| = \verb|training error| - n\sigma^2 + 2p\sigma^2 \end{eqnarray*}\]

  • Error of \(\boldsymbol{\beta}_{best}\) \[\begin{eqnarray*} E\left[\lVert Y-\hat{\hspace{0pt}f}\rVert^2\right]&=&E\left[\lVert (f+\epsilon)-\hat{\hspace{0pt}f}\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})+\epsilon\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})\rVert^2+\lVert(\epsilon)\rVert^2-2\langle(f-\hat{\hspace{0pt}f})\ \epsilon\rangle\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2\\ \end{eqnarray*}\] Also we see that error of \(\boldsymbol{\beta}_{best}\) = training error + \(p\sigma^2\).

Training error, estimation error, error of \(\boldsymbol{\beta}_{best}\)and testing error

training error \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2-p\sigma^2\)
estimation error \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+p\sigma^2\)
error of \(\boldsymbol{\beta}_{best}\) \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2\)
testing error \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2+p\sigma^2\).

Thus, we see that

when \(n>2p\) estimation error \(\leq\) training error \(\leq\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) testing error
when \(2p>n>p\) training error \(\leq\) estimation error \(\leq\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) testing error
when \(n=p\) training error \(\leq\) estimation error \(=\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) testing error
when \(n<p\) training error \(\leq\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) estimation error \(\leq\) testing error

This shows that the relative size of n to p affects the relative size of the estimation error to the other errors, but the relative size of the estimation error, error of \(\boldsymbol{\beta}_{best}\), and testing error are stable. If we increase \(p\) we will have less training error because of the \(-p\sigma^2\) term, which also increases the testing error by the \(+p\sigma^2\) term.

When you plot the errors against p, you will see that the training error will always go down with an increase in \(p\), but the testing error will go down up to some point and then increase. This is because when you increase \(p\), not only does \(-p\sigma^2\) term decrease the error for training, but also, an increase in \(p\) will reduce the model bias term \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2\). The reduction in error by the reduction in model bias is responsible for the initial decrease in testing error. However, at some point, the decrease in model bias gets overwhelmed by the \(+p\sigma^2\) term, which is the reason testing error increases at some point.

Some thoughts on extreme cases

In extreme cases where \(Y\) is pure noise, that is \(f=0\). Then \(n=p\) and \(\epsilon=\hat{\hspace{0pt}\epsilon}\). Therefore, your training error will be 0, but when you test your model, then \(\epsilon_{new}+\hat{\hspace{0pt}\epsilon}=\epsilon_{new}+\epsilon\) which is more error in general. Generally speaking, you don’t want to be too clever and interpret too many patterns in your error, or else you will be penalized in your test.

Decomposition of estimation error

For general estimator \(\tilde{\hspace{0pt}\boldsymbol{\beta}}\), let \(\hat{\hspace{0pt}f}=\mathbf{X}\boldsymbol{\beta}_{best}\), and let \(\mu=E[\tilde{\hspace{0pt}\boldsymbol{\beta}}]\). Then estimation error can be decomposed into model bias, estimation bias, and estimation variance.

\[\begin{eqnarray*} E\lVert f-\mathbf{X}\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2&=&E\lVert f-\hat{\hspace{0pt}f}+\mathbf{X}\boldsymbol{\beta}_{best}-\mathbf{X}\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\\ &=&\lVert f-\hat{\hspace{0pt}f}\rVert^2+E\lVert \mathbf{X}\boldsymbol{\beta}_{best}-\mathbf{X}\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\\ &=&\lVert f-\hat{\hspace{0pt}f}\rVert^2+\lVert \boldsymbol{\beta}_{best}-\mu \rVert^2 +E\lVert \mu - \tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2 \\ \end{eqnarray*}\]

If the \(\tilde{\hspace{0pt}\boldsymbol{\beta}}=\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}\), that is \(E[\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}]=\boldsymbol{\beta}_{best}\), then \(\lVert \boldsymbol{\beta}_{best}-\mu \rVert^2=0\).

Model bias and estimation bias can be very ambiguous. Suppose there are 20000 predictors, \(p=20000\) and \(n\), that are not quite as large as \(p\). To fit a model in such a situation, there are two ways to go about it, both of which will be discussed later under the topic of regularization. One is to do model selection, where you leave the most significant predictors in the model and exclude the rest. Another is to do a shrinkage estimation where you shrink most of the estimates down to zero. Essentially, you are doing the same thing; however, model selection will increase the model bias, whereas the shrinkage method will increase the estimation bias.

Property of General Estimator Error

  1. For a general estimator of regression coefficients, investigate the relationship between the testing (generalization, prediction) error and training error and identify the effective degrees of freedom.
obs. predictor 1 predictor 2 \(\cdots\) predictor p \(f\) randomness in training training data randomness in testing testing data
1 \(x_{11}\) \(x_{12}\) \(\cdots\) \(x_{1p}\) \(f_1\) \(\epsilon_1\) \(y_1=f_1+\epsilon_1\) \(\epsilon_{1,new}\) \(y_{1,new}=f_1+\epsilon_{1,new}\)
2 \(x_{21}\) \(x_{22}\) \(\cdots\) \(x_{2p}\) \(f_2\) \(\epsilon_2\) \(y_2=f_2+\epsilon_2\) \(\epsilon_{2,new}\) \(y_{2,new}=f_2+\epsilon_{2,new}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
n \(x_{n1}\) \(x_{n2}\) \(\cdots\) \(x_{np}\) \(f_n\) \(\epsilon_n\) \(y_n=f_n+\epsilon_n\) \(\epsilon_{n,new}\) \(y_{n,new}=f_n+\epsilon_{n,new}\)
  \(\mathbf{X}_{1}\) \(\mathbf{X}_{2}\) \(\cdots\) \(\mathbf{X}_{p}\) \(f\) \(\epsilon\) \(Y=f+\epsilon\) \(\epsilon_{new}\) \(Y_{new}=f+\epsilon_{new}\)

For general estimator \(\tilde{\hspace{0pt}\boldsymbol{\beta}}=[\tilde{\hspace{0pt}\boldsymbol{\beta}}_1,\cdots,\tilde{\hspace{0pt}\boldsymbol{\beta}}_p]^T\), let \(\hat{\hspace{0pt}f}=X\tilde{\hspace{0pt}\boldsymbol{\beta}}\), and let \(\mu=E[\tilde{\hspace{0pt}\boldsymbol{\beta}}]=[\mu_{\tilde{\hspace{0pt}\boldsymbol{\beta}_1}},\cdots,\mu_{\tilde{\hspace{0pt}\boldsymbol{\beta}_p}}]^T\). Then, estimation error, training error, and testing error can be written as:

  • Estimation Error \[\begin{eqnarray*} E\left[\lVert f-\hat{\hspace{0pt}f}\rVert^2\right] &=& E\left[\lVert f-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right] \end{eqnarray*}\]

  • Training Error \[\begin{eqnarray*} E\left[\lVert Y-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]&=&E\left[\lVert f+\epsilon-\hat{\hspace{0pt}f}\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}+\epsilon\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}\rVert^2+\rVert \epsilon\rVert^2+2\langle f-\hat{\hspace{0pt}f},\ \epsilon \rangle\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}\rVert^2\right]+E\left[\rVert \epsilon\rVert^2\right]+2E\left[\langle f-\hat{\hspace{0pt}f},\ \epsilon \rangle\right]\\ &=&E\left[\rVert f-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]+n\sigma^2-2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\\ &=&\verb|estimation error|+n\sigma^2-2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\\ \end{eqnarray*}\]

  • Testing Error \[\begin{eqnarray*} E\left[\rVert Y_{new}-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]&=&E\left[\rVert f+\epsilon_{new}-\hat{\hspace{0pt}f}\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}+\epsilon_{new}\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}\rVert^2+\rVert \epsilon_{new}\rVert^2\right]\\ &=&E\left[\rVert f-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]+n\sigma^2\\ &=&\verb|estimation error|+n\sigma^2\\ &=&\verb|training error|+2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\\ \end{eqnarray*}\]

Over-fitting

Let’s look closely at the following relationships.

  • \[\mbox{training error} =\mbox{estimation error}+n\sigma^2-2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\]
  • \[\mbox{testing error} = \mbox{training error}+2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\]

You can see that the training error is decreased by \(2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) term, but the testing error is increased by the same \(2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) term. Generally speaking, \(E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) term is like a correlation between your estimated signal \(\hat{\hspace{0pt}f}\) and noise \(\epsilon\). So it can be interpreted as how much noise your estimator \(\tilde{\hspace{0pt}\boldsymbol{\beta}}\) absorbs. Thus, when you fit a model, you want to control this in order to avoid over-fitting.

The effective degrees of freedom

For Least Squares Estimate testing error =training error\(+2p\sigma^2\).If we let \(2p\sigma^2=2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) we get \(p=\frac{E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle}{\sigma^2}\) hence we see that effective degrees of freedom in \(\tilde{\hspace{0pt}\boldsymbol{\beta}}\) is \(\frac{E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle}{\sigma^2}\).